Repository Structure:

benchmark_datasets & evaluation_code 
├── INSTRUCTION.md                    # Instructions on how to use
├── evaluate_benchmark.py             # Evaluation script
├── prompt.py                         # Prompt templates
├── evaluator/
│   ├── __int__.py
│   ├── critic.py                     # Critic model evaluation interface
│   └── llm.py                        # LLM evaluation interface
└── benchmark_query/
    ├── benchmark_all.jsonl           # Full dataset (1,000 queries)
    └── requirement/
        ├── style/           
        │   ├── style_subset.jsonl    # requirement-involved subset for style
        │   └── style_subset_C.jsonl  # category-specific subset for style
        ├── format/          
        │   ├── format_subset.jsonl    # requirement-involved subset for format
        │   └── format_subset_C.jsonl  # category-specific subset for format
        └── length/         
            ├── length_subset.jsonl    # requirement-involved subset for length
            └── length_subset_C.jsonl  # category-specific subset for length


Github Repo: https://github.com/X-PLUG/WritingBench
Leaderboard: https://huggingface.co/spaces/WritingBench/WritingBench
Critic Model: https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B
Writing Model (Qwen-2.5-7B-filtered): https://huggingface.co/AQuarterMile/Writing-Model-Qwen-7B